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CONDITIONS FOR RAPID MIXING OF PARALLEL AND 
SIMULATED TEMPERING ON MULTIMODAL DISTRIBUTIONS 

By Dawn B. Woodard , Scott C. Schmidler and Mark Huber 

Duke University 

We give conditions under which a Markov chain constructed via 
parallel or simulated tempering is guaranteed to be rapidly mixing, 
which are applicable to a wide range of multimodal distributions 
arising in Bayesian statistical inference and statistical mechanics. We 
provide lower bounds on the spectral gaps of parallel and simulated 
tempering. These bounds imply a single set of sufficient conditions for 
rapid mixing of both techniques. A direct consequence of our results 
is rapid mixing of parallel and simulated tempering for several normal 
mixture models, and for the mean-field Ising model. 

1. Introduction. Stochastic sampling methods have become ubiquitous 
in statistics, computer science and statistical physics. When independent 
samples from a target distribution are difficult to obtain, a widely- applicable 
alternative is to construct a Markov chain having the target distribution as 
its limiting distribution. Sample paths from simulating the Markov chain 
then yield laws of large numbers and often central limit theorems [20, 26], 
and thus are widely used for Monte Carlo integration and approximate 
counting. Such Markov chain Monte Carlo (MCMC) methods have revo- 
lutionized computation in Bayesian statistics [9] , provided significant break- 
throughs in theoretical computer science [10] and become a staple of physical 
simulations [2, 18]. 

A common difficulty arising in the application of MCMC methods is that 
many target distributions arising in statistics and statistical physics are 
strongly multimodal; in such cases, the Markov chain can take an impracti- 
cally long time to reach stationarity. Since the most commonly used MCMC 
algorithms construct reversible Markov chains, or can be made reversible 
without significant alteration, the convergence rate is bounded by the spec- 
tral gap of the transition operator (kernel). A variety of techniques have 
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been developed to obtain bounds on the spectral gap of reversible Markov 
chains [6, 12, 24, 25]. For multimodal or other target distributions where the 
state space can be partitioned into high probability subsets between which 
the kernel rarely moves, the spectral gap will be small. 

Two of the most popular and empirically successful MCMC algorithms for 
multimodal problems are Metropolis-coupled MCMC or parallel tempering 
[7] and simulated tempering [8, 17]. Adequate theoretical characterization 
of chains constructed in such a manner is therefore of significant interest. 
Toward this end, Zheng [29] bounds the spectral gap of simulated temper- 
ing below by a multiple of the spectral gap of parallel tempering, with a 
multiplier depending on a measure of overlap between distributions at adja- 
cent temperatures. Madras and Piccioni [13] analyze a variant of simulated 
tempering as a mixture of the component chains at each temperature. 

Madras and Randall [14] develop decomposition theorems for bounding 
the spectral gap of a Markov chain, then use those theorems to bound the 
mixing (equiv., convergence) of simulated tempering in terms of the slowest 
mixing of the tempered chains. If Metropolis-Hastings mixes slowly on the 
original (untempered) distribution, their bound cannot be used to show 
rapid mixing of simulated tempering. 

However, rapid mixing of simulated tempering has been shown for sev- 
eral specific multimodal distributions for which local Metropolis-Hastings 
mixes slowly. Madras and Zheng [16] bound the spectral gap of parallel and 
simulated tempering on two examples, the "exponential valley" density and 
the mean- field Ising model. They use the decomposition theorems of [14]. 
However, unlike [14], they decompose the state spaces of their examples into 
two symmetric halves. Then they bound the mixing of parallel and simu- 
lated tempering in terms of the mixing of Metropolis-Hastings within each 
half. Since for these examples Metropolis-Hastings is rapidly mixing on each 
half of the space, their bounds show rapid mixing of parallel and simulated 
tempering. This is in contrast to the standard (untempered) Metropolis- 
Hastings chain, which is torpidly mixing. Here "torpid mixing" means that 
the spectral gap decreases exponentially as a function of the problem size, 
while "rapid mixing" means that it decreases polynomially. The torpid/rapid 
mixing distinction is a measure of the computational efficiency of the algo- 
rithm. 

The results of [16] are extended by Bhatnagar and Randall [1] to show the 
rapid mixing of parallel and simulated tempering on an asymmetric version 
of the exponential valley density and the rapid mixing of a variant of parallel 
tempering on the mean-field Ising model with external field. These authors 
also show that parallel and simulated tempering are torpidly mixing on the 
mean-field Potts model with q = 3, regardless of the number and choice of 
temperatures. 
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We generalize the decomposition approach of [16] and [1] to obtain lower 
bounds on the spectral gaps of parallel and simulated tempering for any 
target distribution, defined on any state space, and any choice of tempera- 
tures (Theorem 3.1 and Corollary 3.1). Conceptually, we partition the state 
space into subsets on which the target density is unimodal. Then we bound 
the spectral gap of parallel and simulated tempering in terms of the mix- 
ing within each unimodal subset and the mixing among the subsets. Since 
Metropolis-Hastings for a unimodal distribution is often rapidly mixing, 
these bounds can be tighter than the simulated tempering bound of [14] . 

Our bounds imply a set of conditions under which parallel and simulated 
tempering chains are guaranteed to be rapidly mixing. The first is that 
Metropolis-Hastings is rapidly mixing when restricted to any one of the 
unimodal subsets. The challenge is then to ensure that the tempering chain 
is able to cross between the modes efficiently. In order to guarantee rapid 
mixing of the tempering chain, the second condition is that the highest- 
temperature chain mixes rapidly among the unimodal subsets. The third is 
that the overlap between distributions at adjacent temperatures decreases 
no more than polynomially in the problem size, which is necessary in order 
to mix rapidly among the temperatures. In the case where the modes are 
symmetric (as defined in Section 4.1), these conditions guarantee rapid mix- 
ing. We give two examples where they hold: an equally weighted mixture of 
normal distributions in M.^'^ with identity covariance matrices (Section 4.1.2), 
and the mean- field Ising model (Section 4.1.1). Mixtures of normal distri- 
butions are of interest due to the fact that they closely approximate many 
multimodal distributions encountered in statistical practice. 

When the modes are asymmetric, the three conditions above are not 
enough to guarantee rapid mixing, as implied by counterexamples given in 
the companion paper by Woodard, Schmidler and Huber [28]. In Section 4.2, 
we obtain an additional (fourth) condition that guarantees rapid mixing in 
the general case, and use this to show rapid mixing of parallel and simulated 
tempering for a mixture of normal distributions with unequal weights. 

2. Preliminaries. Consider a measure space {X,J^,X). Often X is count- 
able and A is counting measure, or A' = M'^ and A is Lebesgue measure, but 
more general spaces are possible. When we refer to a subset Ag X, we will 
implicitly assume that it is measurable with respect to A. In order to draw 
samples from a distribution fj, on {X,J^), one may simulate a Markov chain 
that has /i as its limiting distribution, as we now describe. Let P be a tran- 
sition kernel on X, defined as in [26], which operates on distributions on the 
left, so that for any distribution fi: 



{fiP){A) = / fi{dx)P{x, A) yA C X. 



4 



D. B. WOODARD, S. C. SCHMIDLER AND M. HUBER 



If fj,P = /i, then call /i a stationary distribution of P. One way of finding a 
transition kernel with stationary distribution fi is by constructing it to be 
reversible with respect to /i, as we now describe. 

P operates on real-valued functions / on the right, so that for any such 

/, 

{Pf){x)= J f{y)P{x,dy) yxeX. 

Define the inner product {f,g)^ = J f{x)g{x)fi{dx) and denote by L2{^i) the 
set of functions / such that {f-,f)^ <oo. P is called reversible with respect 
to /i if {f,Pg)fj, = [Pf,g)^ for all /, 5 G L2[fi) and nonnegative definite if 
{Pf,f)^ > for all / G L2(/-f). If P is reversible with respect to /i, then 
/i is easily seen to be a stationary distribution of P. We will primarily be 
interested in the case where /i has a density vr with respect to A, and we 
define tt[A] = n{A) and define (/, 9)77, L2{tt), and vr-reversibility to be the 
same as for fi. 

If P is (/)- irreducible and aperiodic (defined as in [22]), nonnegative definite 
and //-reversible, then the Markov chain with transition kernel P converges 
in distribution to /i at a rate bounded by the spectral gap: 

(1) Gap(P)= inf fl^^V 

Var^(/)>0 

where £{f,f) is the Dirichlet form (/, (/ — P)f)^, and Var^(/) is the vari- 
ance {f,f)f_i — (/, 1)^. That is, for every distribution /to having a density 
with respect to fi, there is some h{^Q) > such that Vn G N, 

ll^oi^" - /illTV = sup |/ioP"(^) - ^i{A)\ 

(2) 

where || • ||tv is the total variation distance, and the first inequality comes 
from functional analysis [15, 21, 23]. When Gap(P) > 0, the chain is called 
geometrically ergodic (see, e.g., [22, 23]) and Gap(P) provides a nontrivial 
bound on the convergence rate. Under these conditions, we can obtain sam- 
ples from ^ by simulating P until it has converged arbitrarily close to 
We now describe a common way of constructing a transition kernel that is 
reversible with respect to a particular density of interest vr. 

2.1. Metropolis-Hastings. Consider a transition kernel P(u', dz) (the "pro- 
posal kernel") having a density p(u), •) with respect to A for each ij;, so that 
P{w, dz) = p{w, z)X{dz), and define the "Metropolis-Hastings transition ker- 
nel for P with respect to vr" as follows. If the current state is w, propose a 
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move z according to P{w,-), accept the move with probabiHty 

p{w, z) = mm\ 1, r \ 

and otherwise reject. Denote the resulting transition kernel by -PmHi it is 
easily seen to be reversible with respect to vr. 

2.2. Parallel and simulated tempering. If the Metropolis-Hastings pro- 
posal kernel moves only locally in the space, and if vr has more than one 
mode, then the chain -Pmh may move between the modes of vr infrequently. 
Tempering is a modification of Metropolis-Hastings where the density of 
interest vr is "flattened" in order to facilitate movement among the modes 
of vr. 

The parallel tempering algorithm [7] simulates parallel Markov chains de- 
fined on tempered densities TTkiz) oc Tr{z)^'' for z ^ X, where (/3o,/3i, • • • ,(3n) 
is a sequence of "inverse temperature" parameters chosen to satisfy < /3o < 
• • • < (3n = 1 and / 7r{z)^° X{dz) < oo. The choice of inverse temperatures is 
flexible; a common choice is a geometric progression, and [19] provides an 
asymptotic optimality result to support this. The chains occasionally "swap" 
the states of adjacent temperature levels, resulting in a single Markov chain 
with state x = (x[o] , • • • , a^[Ar]) on the space Xpt = Swaps are accepted 

according a Metropolis criteria which preserves the joint density 

N 
k=0 

with product measure \pt{dx) = HfeLo '^('^^[fc])- It is easy to see that the 
marginal density of xj^j under stationarity is vr, the density of interest. 

For concreteness, we consider the following speciflc interleaving of swap 
and update moves, and we add to each move a 1/2 holding probability to 
guarantee nonnegative definiteness. The update move chooses k uniformly 
on {0, . . . , A^} and updates xj^j according to some vr/j -reversible transition 
kernel r^, yielding a transition kernel T on Xp^: 

1 ^ 

T{x,dy) = 2(^jsi^i-^ Tk{x[k],dyik])6x^_^^^{y[-k]) dy[-k], x,y€ Xpt, 

where X[_k] = (xjoj , , . . . , X[k-i] , X[k+i] , ■ • ■ > a;[Af] ) and 6 is Dirac's delta func- 
tion. Often Tfc is a Metropolis-Hastings kernel with respect to vr^. 

The swap move Q samples k uniformly from {0, . . . , — 1} and proposes 
exchanging the value of x^^j with that of x^^+i] ■ The proposed state, denoted 
{k,k + l)x, is accepted according to the Metropolis criteria preserving vrpt: 

t (I, ^ • /i ^k{x[k+i])T^k+i{x[k]) \ 

p{x, ik, k + l)x) = mini 1, — ^ \ 

I 7rfc(x[fc])vrfc+i(x[fc+i]) J 
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SO that for any A C Xpt and x G Xpt , 



Q(^,^) = ^ E U{ik,k + l)x)p{x,{k,k + l)x) 



k=0 

N-1 



k=0 

where 1a is the indicator function of the set A. Both T and Q are easily seen 
to be reversible with respect to vTpt by construction, and nonnegative definite 
due to their 1/2 holding probability. We define the parallel tempering chain 
-fpt = QTQ which performs two swapping moves for each update move. It 
is easily verified that Ppt is nonnegative definite and reversible with respect 
to TTpt, so the convergence of the parallel tempering chain to vTpt may be 
bounded using the spectral gap of Ppt . 

Note that the definitions of T and Q do not rely on TTfc oc vr^*^ , and indeed 
we may specify distributions vr^ in any convenient way subject to vttv = vr. 
We refer to the resulting chain as a swapping chain, and denote the corre- 
sponding state space, measure, transition kernel and associated stationary 
density by Xsc, Age, Psc and vTsc, respectively. Although the terms "paral- 
lel tempering chain" and "swapping chain" are used interchangeably in the 
computer science literature, we follow the statistics and physics literature 
in defining a parallel tempering chain as using tempered distributions, and 
define a "swapping chain" as using arbitrary distributions. 

The related technique of simulated tempering [8, 17] has state (z, k) G 
Xst = X fSi {0, . . . , N} and stationary density 

7rst(^,fc) = j^^_^^ T^k{z), {z,k)eXst, 

with two move types: the first (T') updates z £ X according to T^, condi- 
tional on k, and the second {Q') samples the level k from its conditional 
distribution given z. Once again, a holding probability of 1/2 is added to 
both T' and Q' . The transition kernel is then specified as Pst = Q'T'Q' . For 
a lack of separate terms, we use "simulated tempering" to mean any such 
chain P^t, regardless of whether or not the densities nk are tempered versions 
of vr. 



3. Lower bounds on the spectral gaps of swapping and simulated temper- 
ing chains. Two key results of the current paper are lower bounds on the 
spectral gaps of swapping and simulated tempering chains. These bounds 
imply the conditions for rapid mixing given in Section 4. The bounds are in 
terms of several quantities. Informally, the first quantity measures how well 
each chain mixes when restricted to each unimodal subset. The second 
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is how well the highest-temperature chain Tq mixes among the subsets. The 
third is the overlap of the distributions of adjacent levels, and the fourth con- 
cerns the probability of each unimodal subset as a function of the (inverse) 
temperature. 

In order to bound the spectral gap of a swapping or simulated tempering 
chain in terms of the mixing of the chain within each subset and the mixing 
of the chain among the subsets, we use a state space decomposition result 
due to Caracciollo, Pelissetto and Sokal [3] and first published by Madras 
and Randall [14]. As in [14], we will use the following definitions. 

For any transition kernel P reversible with respect to a distribution fi and 
any subset A of the state space of P, define the restriction of P to ^ as 



(3) P\a{x,B) = P{x,B) + 1b{x)P{x,A'') ioT:xeA,BcA. 



Note that P\a is reversible with respect to fi\A, the restriction of to A. 
Now take any partition A = {Aj : j = 1, . . . , J} of the state space of P such 
that niAj) > for all j, and define the projection matrix of P with respect 
to A as 



Note that P is reversible with respect to the distribution on j E {1, . . . , J} 
taking value fJ-iAj), and irreducible if P is. 

Now consider a swapping or simulated tempering chain defined as in Sec- 
tion 2.2 for some density of interest tt on a measure space {X,J^,X), with 
TTfc-reversible transition kernels T^. Let A be any partition of X such that 
7rfc[^j] > for all k and j. The first quantity in our bound is the minimum 
over k and j of Gap(Tfc|^ .), which measures how well each chain mixes 
within each partition element. The partition would typically be chosen so 
that this quantity is large; in our examples, we choose A so that 7r|^. is 
unimodal (contains a single local mode) for each j. 

Next, we consider how well the chain Tq mixes among the partition ele- 
ments. Let To be the projection matrix of Tq with respect to A; the second 
quantity in our bound is Gap(To). Since A is finite, this is one minus the 
second-largest eigenvalue of Tq . 

The third quantity is the overlap of {vrfc : /c = 0, . . . , N} with respect to A, 
defined as 





J}. 



(5) 



6{A) 



mm 

k-l\=l 



I mm{TTk{z),7ri{z)}X{dz) / 7rk[Aj]. 
Aj J 



ie{i,...,J} 



The quantity 5{A) controls the rate of temperature changes in simulated 
tempering. For the swapping chain, note that for any j G {1, . . . , J} and 
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any k G {0, . . . , — 1}, the marginal probability at stationarity of accepting 
a proposed swap between xj^] G Ai and Xf^+i] G Aj is 

IzGAjweAi^^^{'^kiz)TTk+iiyj),T^k{w)'^k+iiz)}X{dw)X{dz) 

(6) m m >o{A) . 

7rk[Ai]Trk+i[Aj] 

We will show that our overlap quantity d{A) is bounded below by the overlap 
used in Madras and Randall [14] and Zheng [29], and that our definition is 
equal to theirs in the case of vr symmetric (as defined in Section 4.1). 

The fourth and final quantity concerns the probability of a single partition 
element under vr^, as a function of k, for each partition element: 

(7) 7(^)=. min nmin|l,^^^^j. 

Note that for any j £ {1, . . . , J} and any k, I £ {0, . . . , N} such that k < I, 
T^ki-Aj] ^ 7(-^)^i[^j]- If '''"fci^j] is a monotonic function of k for each j (which 
need not hold for tempered distributions), then 7(^4) simplifies to 

mm — f^. 

iG{l,...,J} TTNlAj] 

With these definitions, the following theorem bounds the spectral gap of the 
swapping chain. 

Theorem 3.1. Given any partition A = {Aj : j = 1, . . . , J} of X such 
that TTi^[Aj] > for all k and j, and given S{A) as in (5) and "f{A) as in (7), 

Gap(Psc) > ( ^i^r+yia ) Gap(fo)mmGap(T,U^.). 

In particular, the bound holds for parallel tempering with vr^ oc 7r^'=. The- 
orem 3.1 will be proven in Section 6. Note that 

Lmm{Trk{z),7rk+iiz)}X{dz) 

dlA) = mm — — — --—r 

fce{o,...,iV-i} max{TTk[Aj\,TTk+i[Aj\} 

ie{i,...,J} 

J^^inm{Trk{z),7rk+i{z)}X{dz) 

(8) < mm 



fce{o,...,iV-i} inax{Y,j T^k[Aj],Y.j '^k+ii-Aj]} 
^min ^ J m.m{Trk{z),Trk+i{z)}X{dz). 



fcG{0,...,Af-l}, 

The final expression for S{A) is the definition of overlap that is used in [14] 
and [29]. Therefore, Theorem 3 of [29], along with our Theorem 3.1, implies 
the following bound for simulated tempering: 
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Corollary 3.1. Let Pst be the simulated tempering chain defined with 
the same N and same set of densities Hk o-s the swapping chain. Then 

Gap(Pst) > ( ^M^r+ipjs ) Gap(ro)mmGap(T,U^.). 

4. Examples of rapid mixing. We will show rapid mixing of parallel and 
simulated tempering for several examples by applying Theorem 3.1 and 
Corollary 3.1. We are particularly interested in cases where Metropolis- 
Hastings with local proposals is torpidly mixing due to the multimodality 
of the target density vr. To show rapid mixing of tempering for a case where 
the number of modes J is fixed, we choose a set of temperatures the number 
of which grows at most polynomially in the problem size. We then show 
that each chain is rapidly mixing when restricted to each unimodal sub- 
set [meaning that miuj ^ Gap(rfc|^^-) decreases at most polynomially in the 
problem size], and that the highest-temperature chain mixes rapidly among 
the subsets. Additionally, we show that the overlap of the parallel or simu- 
lated tempering chain decreases at most polynomially in the problem size, 
which is needed in order to mix rapidly among the temperatures. In the case 
where vr is symmetric with respect to A (to be defined), we will see that 
7(^) = 1, so the above conditions imply that parallel tempering is rapidly 
mixing (by Theorem 3.1). The same conditions imply the rapid mixing of 
simulated tempering (by Corollary 3.1) using the same tempered densities. 

The above conditions are necessary as well as sufficient for rapid mixing 
in the symmetric case. Simple examples can be constructed to show the 
necessity of each condition; Woodard, Schmidler and Huber [28] give an 
example of a symmetric distribution for which the first two conditions hold, 
but the condition on the overlap fails, and parallel and simulated tempering 
are torpidly mixing. 

In the case of general (not necessarily symmetric) vr, the above conditions 
are insufficient to guarantee rapid mixing; counterexamples are given in [28] . 
In the general case, we must additionally show that 7(.A) decreases at most 
polynomially in the problem size. 

4.1. Examples of rapid mixing on symmetric distributions. Recall that 
the target density vr is defined on a state space X with measure A. Define 
vr to be symmetric with respect to a partition {Aj : j = 1, . . . , J} of ^ if for 
every pair of partition elements Ai, Aj there is some A-measure-preserving 
bijection fij from Ai to Aj that preserves tt. Note that when vr is symmetric 
with respect to {Aj : j = 1, . . . , J}, the inequality in (8) is an equality. Addi- 
tionally, TTkl^j] = 1/ J for all /c e {0, . . . , N} and j G {1, . . . , J}, so -f{A) = 1. 
We will give examples of symmetric tt for which parallel and simulated tem- 
pering are rapidly mixing. 
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4.1.1. The mean field Ising model. For each M S N, the mean field Ising 
model is defined for z ^ X = {— 1, as: 

1 f / Af \ 2 > 
(9) -(-) = ^exp|^(E^ 

where Z = J2z 6xp{a(X]j -^i)^/ (2M)}. The single-site proposal kernel S chooses 
i £ {1, • . • ,M} uniformly at random and proposes switching the sign of Zj. 
Metropolis-Hastings for 5 with respect to vr is torpidly mixing for a > 1, as 
is straightforward to show using conductance (as defined in [12, 25]). 

Taking = M, [3^ = k/N and T^. equal to Metropolis-Hastings for S with 
respect to Tr^ oc vr^* , it is shown by Madras and Zheng [16] that parallel and 
simulated tempering are rapidly mixing. We will show that this is also a 
consequence of our Theorem 3.1 and Corollary 3.1. 

As in [16], partition X into Ai = {z £ : J2i < 0} and A2 = {z £ X : 
Y^i^i > 0}. Restricting to M odd, the density vr is clearly symmetric with 
respect to the partition. 

Using the fact that ttq is uniform, it is straightforward to show that Tq is 
rapidly mixing. Since Tq is rapidly mixing, so is Tq (see Theorem 5.2). Ad- 
ditionally, it is shown in [16] that the minimum over k and j of Gap(rfc|^^, ) 
is polynomially decreasing in M. 

Note that for any z £ X and any k £ {0, . . . ,N — 1}, 

1 1 



Therefore, 



exp{a/2} 

7r(z)^^+i"'"^-( ) e [exp{-a/2},exp{a/2}] 

T^k+iiz)' 



which implies that 



mm{TTkiz),7rk+iiz)} = ^ 7rfc(2;)min<^ 1, 
zgx zsx ^ 

>exp{-a/2}. 



Recalling that for vr symmetric, the inequality in (8) is an equality, 5{A) 
is bounded below by a constant for all M. Therefore, by Theorem 3.1 and 
Corollary 3.1, the parallel or simulated tempering chain is rapidly mixing. 

4.1.2. A symmetric normal mixture. Many multimodal distributions aris- 
ing in statistics are well approximated by mixtures of normal distributions. 
We will analyze the mixing of parallel and simulated tempering on several 
two-component mixtures of normal distributions in M^^. For any length-M 
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vector u, M X M covariance matrix S, and z G M , let A^jv/ (z; z^, S) be the 
density of a multivariate normal distribution in M*^ with mean u and covari- 
ance S, evaluated at z. Let 1m denote the vector of M ones, and Im denote 
the M X M identity matrix. Take any 6 > and any sequence ai, 02, . . . such 
that ajv/ G (0, 1) for each M, and consider the following weighted mixture of 
two normal densities in M^: 

(10) 7r(z) =aAfiVAf(z;-6lA/,/A/) + (1 - aM)NM{z;blM Jm)- 

Let S be the proposal kernel that is uniform on the ball of radius 
centered at the current state. Partition X into Ai = {z: Zi < 0} and A2 = 
{z : J2i Zi>0}. For technical reasons, we will use the following approximation 
to vr: 

(11) 7r{z) oc auNuiz; -blu, lM)'i-Aiiz) + (1 - aA/)A^Af (2;; &1a/, 4/)1a2 (2) 

which truncates the overlapping portions of the tails. This simplification 
does not alter the mixing properties of Metropolis-Hastings or tempering, 
with the exception of the case where either oa/ or 1 — qm approaches zero 
very quickly so that vr is unimodal for large M while vr is bimodal. However, 
we are interested only in the situation where vr is bimodal, causing poor 
mixing of Metropolis-Hastings and suggesting the more efficient application 
of tempering; for this purpose vr is equivalent to vr. 
Metropolis-Hastings for S with respect to the density 

7f|Ai(^;) oc NMiz; -blM, /a//)1ai(^;) 

or with respect to 

TrlAiiz) cc NM{z;hlM , Im)1a2{z) 

is rapidly mixing in M, as implied by results in Kannan and Li [11] (details 
are given in [27]). However, we will show that Metropolis-Hastings for S 
with respect to vf is torpidly mixing. Then we will specify a set of inverse 
temperatures which yield rapidly mixing parallel and simulated tempering 
chains for vf. 

Consider Metropolis-Hastings for 5 with respect to vf. Note that the 
boundary of Ai with respect to the Metropolis-Hastings kernel is the set 
oi z ^ Ai within distance of the set A2. It is straightforward to show 
that the probability of this boundary under tt\ai decreases exponentially 
as a function of M. Similarly, the probability of the boundary of A2 under 
7r|^2 decreases exponentially in M . Therefore, the conductance of the set 
Ai (where conductance is defined as in [12, 25]) is exponentially decreasing, 
which implies that Metropolis-Hastings is torpidly mixing. 

For any (3, define the tempered density vf/j oc fr^. Note that for any (3, 

Txpiz) OC aljNMiz; -61a/, /3""^/a/)1ai (2) 

+ (1 - aMfNMiz;blM,(3-^lM)lAAz)- 
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The normalizing constant is [a^.j + (1 — aM)^]^{by/MP^^'^), where $ is the 
cumulative normal distribution function in one dimension. 

Metropolis-Hastings for S with respect to vtm-i mixes rapidly between 
Ai and A2, shown as follows. The probability of the boundary of Ai under 

ttm-Aai is 

(12) [$(6)-$(6(l-M~i))]/$(6) 

which is polynomially decreasing in M. The probability of the boundary 
of A2 under 7rj\^-i|A2 is also equal to (12). It can also be shown that the 
marginal probability of accepting proposed moves between Ai and A2 is 
polynomially decreasing in M, proving rapid mixing between Ai and A2; 
the details are given in [27]. 

The infimum over /3 > of the spectral gap of Metropolis-Hastings 
for S with respect to tt^Ia decreases at most polynomially in M for j = 1, 2. 
This is because vt^Ia is a normal density restricted to a convex set; a bound 
on the spectral gap of Metropolis-Hastings for S with respect to such a 
density is given in [11], and this bound is polynomially decreasing in M. 
More details are given in [27]. 

When OAf = 1 — om , vr is symmetric with respect to the partition {^1,742}. 
In this case, set N = M and (5^ = M~^^~^^^^ (a geometric progression), 
and let be the Metropolis-Hastings kernel for S with respect to tt^ oc 
jfl^k ^ "With these specifications, parallel and simulated tempering are rapidly 
mixing, as we will show. We have already seen that mmj^kGap{Tk\Aj) is 
polynomially decreasing in M and that Tq is rapidly mixing. Next, we will 
show that 6{A) is also polynomially decreasing in M. 

Let A be Lebesgue measure in R^, and take any k G {0, . . . , A^ — 1}. Noting 
that Pk/Pk+i = M-V*'^, we have 

/ min{7rfc(z),7rfc+i(z)}A((iz) 

= 2 mm{Trk{z),7rk+i{z)}X{dz) 

f . (NM{z;blM,f3k^lM) NM{z;blM,l3klihi) 
= / min< ■ 

Ja2 I ^bVMpk'^^) ' $(6Vm/?^/_\) 
> / mm{NM{z;blM, /3j:^ Im), NM{z-,blM, Pklihi)}Kdz) 

JA2 

(13) = (27r)-^^/2 
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Therefore, 5{A) decreases at most polynomially in M. By Theorem 3.1 and 
CoroUary 3.1, parallel and simulated tempering with this and this set 
of inverse temperatures are rapidly mixing in the case where om = 1 — ^M- 
This is shown in [27] for vr as well as vr. In the case where there are some M 
with om 7^ 1 — aM, we must also verify that 7(^4) is polynomially decreasing 
in M; this is done in the next section. 

4.2. Examples of rapid mixing on general distributions. We now consider 
the case of general (not necessarily symmetric) vr. 



4.2.1. A weighted normal mixture. Recall fr from (11), and assume with- 
out loss of generality that um > 1/2. For technical reasons, we restrict 
aM/(l — o-m) to be exponentially bounded-above, meaning that there ex- 
ists a constant c such that aM/(l — clm) ^ for all M. 

Consider the inverse temperature specification that we used for the sym- 
metric case, with N = M and the set of inverse temperatures {M~(^~'^)/^^ : k ■ 
0, . . . , M}. Also recall the inverse temperature specification for the mean field 
Ising model, with N = M and the set of inverse temperatures {k/M -.k = 
0, . . . , Af}. For the mixture of normals with unequal weights, we will need 
both: take the set of inverse temperatures {M~'^^^~^'>/^'^ :k = 0,..., M} U 
{k/M:k = l,...,M}, so N = 2M. 

The arguments from Section 4.1.2 show that miuj^^ Gap(Tfc|yi^. ) is poly- 
nomially decreasing in M and that Tq is rapidly mixing. Next, we will show 
that 7(^) and 6{A) are also polynomially decreasing in M. 

Note that for any /3, 



a'ij + (1 - aM)^ 

which is an increasing function of (3 since 1/2. Therefore, 7rfc[Ai] is an 

increasing function of k and '/rfc[^2] is a decreasing function of /c, so 

' ttnIAi] - 27rN[Ai] " 2 
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which does not depend on M. Also note that for any k ^ {{),... ,N — 1}, 



> 

Therefore, 



V AM / 



J^^mm{TTk{z),TTk+i{z)}\{dz) 
max{7rfc [^2] , vr^+i [^12]} 

/ . / r/i , NM{z;blM,^k^lM) 
= min<^7rfc yl2 , , % , 

^NM{z;blM,(3klihi)\,,, , 

^fc+l^42 ^ — r ^-^ 

/ max{7rfc[yl2],7rfc+i[A2]} 

7rfc+i[^2] min{A^Af(2; blM,Pk^lM),NM{z; 61a/, /3^^^/A/)}A((iz) 



> 



>c W min{iVM(^;; WM,/3fc ^-^m), A^m(^;; WM,/?fc+i^M)}A((iz) 



2VM 

where the last inequality is from (13), using the fact that Pk/Pk+i ^ M~^^^ . 
This result, repeated for Ai, shows that 6 (A) decreases at most polynomially 
in M. By Theorem 3.1, parallel and simulated tempering with this N and 
this set of inverse temperatures are rapidly mixing on the weighted mixture 
of normals vf. 



5. Tools for bounding spectral gaps. In Sections 5.1 and 5.2, we give 
some results from the literature and slight extensions thereon. These results 
will be used in Section 6 for the proof of Theorem 3.1. 

5.1. A bound for finite state space Markov chains. We first consider 
a method for finite state space Markov chains. Let P and Q be Markov 
chain transition matrices on state space X with \X\ < 00, reversible with 
respect to densities vrp and ttq, respectively. Denote by £p and £q the 
Dirichlet forms of P and Q, and let Ep = {{x,y) ■.TTp{x)P{x,y) > 0} and 
Eq = {{x,y) ■.TrQ(x)Q{x,y) > 0} be the edge sets of P and Q, respectively. 
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For each pair x such that G £'q, fix a path -^xy = {x = xo,xi,X2, 

. . . ,Xk = y) of length \ jxy \ = k such that (xj, Xj+i) G Ep for i G {0, . . . , — 1}. 
Define 

c= max J — TT^T V hxyWQ{x)Q{x,y)\ . 



Theorem 5.1 (Diaconis and Saloff-Coste [4]). 

< c£p. 

5.2. Bounds for general state space Markov chains. The following results 
hold for general state space transition kernels P and Q, reversible with 
respect to distributions /Up and /xg on a space X with countably generated 
(T-algebra. 

Theorem 5.2. Let {Aj : j = 1, . . . ,J} he any partition of X such that 
fj,p{Aj) > for all j . Define P\Aj cls in (3) and P as in (4). For P nonneg- 
ative definite, 

^Gap(P) min Gap(PU,) < Gap(P) < Gap(P). 

The bounds are a direct consequence of results published in Madras and 
Randall [14], as described in the Appendix of the current paper. 

Theorem 5.3 [Diaconis and Saloff-Coste (1996)]. Take any N eN and 
let Pk, k = 0, . . . ,N , be fik-reversible transition kernels on state spaces X^. 
Let P be the transition kernel on X = Ylf. X^ given by 

N 

P{x,dy) = Y^ bkPk {x[k] , dy[k])Sx[_^ {y[-k] ) dy[-k] , x,yeX, 

k=0 

for some set of bk > such that J2k bk = ^ cL^d where 6 is Dirac 's delta 
function. P is called a product chain. It is reversible with respect to fi{dx) = 
Uk l^kidx[k]) , and 

Gap(P) = ^min^6fc Gap(Pfc). 

Lemma 3.2 of [5] states this result for finite state spaces; however, the 
proof of that lemma holds in the general case. 

Lemma 5.1. Let fip = fiq. If Q{x,A\ {x}) < P{x,A\ {x}) for every 
X ^ X and every Ac X , then Gap((5) < Gap(P). 
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Proof. As in [14], write Gap(P) in the form 

p .p^ -r f n\f{x)-f{yrMdx)P{x,dy) 

Varf,^(/)>0 

and write Gap{Q) analogously. The result then follows immediately. □ 



6. Proof of Theorem 3.1. 



6.1. Overview of the proof. As in Madras and Zheng [16], consider the 
space S = Z^+^ of possible assignments of levels to partition elements. For 
X = (x[o], . . . , 2;[7v]) G ^sc, let the signature s{x) be the vector (ctq, • • • , ctn) £ S 
with 

ak = j if eAj {0<k< N) 

and for (T S S , define 

X„ = {x e Xsc ■ s{x) = a} 

so s induces a partition of Xgc- Define = PscIa'ct) let Ac be the pro- 
jection matrix of Pgc with respect to the partition {^^o-}. Since Pgc is non- 
negative definite. Theorem 5.2 gives 

(14) Gap(Psc) > I Gap(Psc)minGap(P<,). 

Theorem 3.1 then follows by deriving bounds on Gap(Psc) and Gap(PCT). 



6.2. Bounding the spectral gap of P^. For o" S S, consider the mixing 
of Psc when restricted to the set X^^. If each of the chains mixes well 
when restricted to the set Au^^ , then the product chain T, and thus Psc will 
also mix well when restricted to X^j. Let To- = T\x^ and note that for any 
x,y £ with X ^ y, 

1 ^ 

To (x, dy) = 2(^j^j^i^ U-fe W ' dy[k])Sxi_k] {y[~k] ) dy[^k] ■ 

Therefore, T„ is a product chain, and Theorem 5.3 provides its spectral gap: 

Gap (To) = r min GapfTfclA^ ) 

' 2(iV + 1) fce{o,...,iV} ' "'^^ 

>— min Gap(Ti.|yi .). 

- 2(iV + l)/fce{o,...,iV} '"^^^ 

ie{i,...,J} 
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Note that sincG Psc — QTQ^ and. since Q lias a 1/2 holding probability, 

Pa{x,dy)>\T„{x,dy) Mx,y£X„. 
Using Lemma 5.1 we have Gap(Po-) ^ Gap(r(j)/4. Therefore 

(15) Gap(P<^)>— mill GapfTfcU )• 

i6{i,...,J} 

6.3. Bounding the spectral gap of Psc- First note that -Psc is reversible 
with respect to the probabihty mass function 

N 
A,-=0 

For any a,T £ S, PsdcTjT) is the conditional probability at stationarity of 
moving to Xj- under Psc, given that the chain is currently in X^: 

Psc(o-,r) = ^r— r / / TTsc{x)Psc{x,dy)Xsc{dx). 

We will begin by bounding this probability in terms of the probability of 
moving to X-r under Q (a swap move) and the probability of moving to 
under T (an update move). For swap moves, let Q be the projection matrix 
of Q with respect to {Xa^ :a £ T,}. Then for k G {0, . . . , — 1}, we have 

Psc{<T, {k, k + l)a) > lQ{a, {k, k + l)a) Va 

where the right-hand side is the conditional probability of swapping xj/;] 
and under Q, and then holding twice. Similarly, for update moves we 

denote by T the projection matrix of T with respect to {X„ :(T G S}, and 
denote cjjjj] = (ctq, . . . , cTj-i, j, cTj+i, . . .,aN)- Then 

-Psc(o-,o-[ij]) > jT{a,a[ij]) Vi, j. 
Therefore, the Dirichlet form £sc of Psc evaluated at / G L2{it*) satisfies 

£^c{f,f)>\£Q{fJ) + l£T{fJ)- 

Recalling that Tq is the projection matrix of Tq with respect to A, note that 
AiN + l)Sfif,f) 

= 2{N + 1) Y: (/(cT)-/(r))V(a)f(a,r) 

<76S j=l 
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= E 
>E 



■ N 
.k=l 



J J 



i=lj=l 



N 



Gap(To 



J J 
i=ij=i 



fij,Cri:N)) TTo[Ai]TTo[Aj 



where the second inequahty is by recognizing the Dirichlet form for Tq. 
Therefore, 



SscifJ) 



(16) 



Gap(ro) 



> 



16(iV + l) 



Now consider a transition kernel T* constructed as follows: with pr 



ity ^ transition according to Q, or (exclusively) with probability 2(n+i) 



jrobabil- 
draw 

CTjo] according to the distribution {TTQ[Aj]: j = 1, . . . , J} (i.e., independent 
samples at the highest temperature); otherwise hold. Note that the Dirich- 
let form of T* is exactly four times the right-hand side of (16). Clearly, T* 
is also reversible with respect to vr*, so Psc and T* have the same stationary 
distribution. Therefore, 



(17) 



Gap(Psc) > 



Gap(r*) Gap(fo) 



We will now bound Gap(T*) by comparison with another 7r*-reversible 
chain. Define the transition matrix T** which chooses k uniformly from 
{0,...,A^} and then draws cjfc according to the distribution {7rk[Aj]:j = 
1,..., J}. Clearly, T** moves easily among the elements of S, and conse- 
quently has a large spectral gap as we will see. By combining (17) with a 
comparison of T* to T**, we will obtain a lower bound on the spectral gap 

of Psc 

Comparison of T* to T** will be done using Theorem 5.1. To simplify no- 
tation, we write 7rk{j) as shorthand for 7rfc[^j] for the remainder of this sec- 
tion. Let j* be the value of j that maximizes ttnU)- ^'^^ each edge (u, (T[jj]) 
in the graph of T** , we define a path ja,a[i in T* with the following 7 
stages: 
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1. Change co to j* . 

2. Swap that j* "up" to level i. 

3. Take the new ai-i (formerly ctj) and swap it "down" to level 0. 

4. Change the value at level to j (from former Uj). 

5. Swap the j at level "up" to level i. 

6. Swap the j* that is now at level i — 1 "down" to level 0. 

7. Change the value at level to ao (from j*). 

In each path, skip all steps that do not change the state. Using the defined 
path set, we will obtain an upper bound c* on the quantity c in Theorem 5.1. 
Since T* and T** both have stationary distribution vr*, Theorem 5.1 then 
yields Gap(r*) > ^Gap(r**). To obtain such an upper bound c*, we will 
use Propositions 6.1 and 6.2. 



Proposition 6.1. For the above-defined paths, 



(18) 



7r*(a)r"(a,a[,,,] 



-(J+3) 



7r*(r)r*(r,e) 
for all a, i and j, and any edge (r, ^) in 7o-,o-[j 

Proof. To obtain (18), first note that 



6iAy 



7r*(cT)r**(<T,a[,,,]) 



N + 1 



N 

n ^fc(^fc) 

Lfc=0 



1 



j—-^ min{7r* (a) , vr* (cj[i j] ) } max{7ri (cTj ) , vr^ ( j) } . 



+ 

For any state r in the path 7o-,(7[jjp we will find a lower bound on vr*(T) in 
terms of min{7r*(cr), 7r*(cr[jj])}. Consider the states in the path To-.o-jijj up 
to stage 4 (where cij is at level 0). We will show that each state r satisfies 
7r*(T) > TT* {a)j{A)'^~^'^ . Then by symmetry, the states from stage 4 to 
the end of the path satisfy vr*(r) > tt* {a[ij-^)j{A)'^^'^ J^^ . 

Any state in stages 1 or 2 of the path from a to ci^j.j] is of the form 
T = (cji, . . . , ai,j*,ai+i, . . . , (Tat) for some I £ {0, ... ,i}. Therefore, 



TT r 



vr cr 



vr cr 



TT 7rfc_i((Jfc) 



7ro(cro) 



■ I J 

nn 

.k=l m=l 



-,/ .vrfc_i(m) 
l(crfc = m) -— - + l{ak 7^ m) 



vrfc(m) 



7ro(cro) 



J N 

Y[ Yl ^^^^ 1' 

.m=l k=l 



>vr*((j) 
>TT*{a)-fiAy+^J-^ 



vrfe(m) 



7ro(cro) 
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where the last inequahty uses the fact that by definition ttnU*) ^ J~^^ so 
TTfcO'*) > 7(^) J"^ for all k and 

7ro(,c^oj 7ro(.o-oj 

Now consider the states in stage 3 of the path, the last of which is also the 
first state in stage 4. Any such state r is of the form 

T = {ai, . . . ,ai,ai,ai+i, . . . , cJi_i, j*, cJi+i, . . . ,cJAr) 

for some / G {0, . . . ,i — I}. Therefore, 



7r*(T) =7r*(a) 



-Q 7rfc-i(crfc) 



where the last step is because I < i. Putting the above together, we have 
that for all r in the path from a to (T[j jj 

(19) 7r*(T) > min{7r*(a),7r*(q,,,])}7(^)^+2^-\ 

We use this to obtain (18) as follows: for any edge (r, ^) on the path 
7o-,cr[i ^] ) we have either ^ = (/c, /c + l)r for some k, oi ^ = rp^rn] for some m. 
The probability of proposing the swap ^ = {k,k + 1)t according to Q is 
Recall that for any /i, Z2 G {1) • • • , J}, ^(•^)^ is a lower bound for the marginal 
probability at stationarity of accepting a proposed swap between Xk € Ai-^ 
and Xk+i G Ai^. Thus, we have Q{t,^) > 5{Ay /{2N), and so 

7T*{a)T**{a,a[ij]) _ n*{a)ni{j) 
7r*(r)r*(T,0 (iV+l)7r*(r)r*(T,0 

^ 2^*(a)^,(j) ^ 47r*(a)7r,(i) 
- (Af + l)^*(r)Q(r,e) " 7r*(r)5(^)2 

(20) 

_ 4 mm{7r* (a) , tt* (ct^^ ,,■] ) } max{7r,t (j) , tt^ (g^ ) } 
" 7r*(r)5(^)2 
^ 4J max{7rj(j),7rj((7j)} ^ 4J 



7(^)-^+25(^)2 - 7(^)-^+25(^)2- 

In the case that instead ^ = Tp^m] foi^ some m, (20) becomes 



(21) 



7r*(r)r*(r,e) 7r*(r)7ro(m) 
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and there are three possible cases: the edge (r, ^) could be stage 1, stage 4 
or stage 7 of 7(T,o-[i ■ If it is stage 1, then (21) is bounded by 

27r*(cj)^i(j) ^ 2 ^ 2 J 



If the move is stage 4, then (21) is bounded by 

2Tr* {a)7ri{j) _ 2 min{7r* (cr) , tt* (c7[^ ) } max{7r,t (j) , tt^ (aj)} 
TT* (r)7ro ( j) min{7r* (r) , vr* (^)} max{7ro (j) , ttq (cJi )} 

2 min|7r*(o"),7r*(cJr,- ,i)) , j , 

7(^) mm|7r*(T),7r*(^)| 

by (19) and since < ^ < ^-^^^oUlM^.)} ^^d 7r,(a,) < ^ < 

max{7ro^(j^,7ro(o-i)} ^ p^j-^g^j^y^ jf ^]^g move is Stage 7, then (21) is bounded by 

27r*(cj)7ri(j) _ 27r*(cj[i,j-])7ri((Ji) ^ 2J 



7r*(a[ij])7ro(j*) 7r*(cj[jj])7ro(j*) 7(^)' 
The result (18) follows for any edge (r, ^) on the path from a to (J[ij]- □ 

Proposition 6.2. For the above-defined paths, 
(22) E |7.,.,,J<16(iV + l)2j2 

/or aril/ erfg'e (r, ,^) in ifte graph of T* . 

Proof. We will bound the number of paths "faa\- i that go through any 
edge (r, ^), and the length of any such path. 

Consider the set of paths for which the edge is in stage 1 of the path. 
Then r = cr and ^ = cpj*], and since i G {0, . . . , N} and j € {1, . . . , J}, there 
are no more than {N + 1) J such paths. Similarly, there are no more than 
[N + 1) J paths for which the edge is in stage 4 of the path. 

Now consider the set of paths for which the edge is in stage 2 of the path. 
Then we must have r = (cJi, . . . , cr;, j*, cr;+i, . . . , gn) for some / E {0, . . . , i — 1} 
and ^ = (Z, / + l)r. do is unknown but has only J possible values, so with i, 
j unknown there are no more than (A^ + 1) such paths. Similarly, there 
are no more than {N + 1) paths for which the edge is in stage 3 of the 
path. 

If the edge has ^ = (/c, fc + 1)t for some k, then it can only be in stages 
2, 3, 5 or 6 of the path, while if ^ = T]^o,m] ^^r some m then it can only be in 
stages 1, 4 or 7. Since the edge can be in at most 4 stages, each with at most 
{N + 1)J^ paths, the total number of paths containing any edge is no more 
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than 4{N + 1) J^. Each of these paths has length at most 4iV + 3 < 4(iV + 1), 
so (22) follows. □ 

Combining Propositions 6.1 and 6.2, we obtain an upper bound on the 
constant c in Theorem 5.1: 

26(iV + l)2j3 



and recalling that both T* and T** have stationary distribution vr*, appli- 
cation of Theorem 5.1 yields 

Gap(T-) > '^:^f_^l\ff, Gap(T-). 

Now since T** is a product chain whose 7r,fc-reversible component chains 
each have spectral gap 1 by definition (1), Theorem 5.3 gives Gap(T**) = 
(N + 1)"-^ and we have 

^^P^^ ^ - 26(iV+l)3j3- 

Then we obtain the bound for Gap(Psc) from (17): 

r.r.(P ^ > Gap(r-)Gap(ro) f j{Ay+H{Ar \ 
(23) Gap(Psc) > J > (^ 28(iV + l)3j3 ) ^^P^^^o)- 

Using (14), (15) and (23), then proves Theorem 3.1. 

As we have seen, Theorem 3.1 bounds the mixing of the swapping chain in 
terms of its mixing within each partition element and its mixing among the 
partition elements. For several multimodal examples, we have given (inverse) 
temperature specifications which guarantee that the four quantities in the 
bound are large, and used Theorem 3.1 and Corollary 3.1 to prove rapid 
mixing of parallel and simulated tempering. 

APPENDIX: PROOF OF THEOREM 5.2 

The upper bound in Theorem 5.2 is shown in [14]. The lower bound uses 
the following results; consider the context of Section 5.2. 

Theorem A.l [Caracciolo, Pelissetto and Sokal (1992)]. Let = fig. 
Assume that P is nonnegative definite and let P^/'^ be its nonnegative square 
root. Then 

Gap(pi/2gpi/2) > Qj^p(p) ^-^ Gap(QU,). 

j=i,...,J 



This result is due to Caracciolo, Pelissetto and Sokal [3], but first appeared 
in Madras and Randall [14]. 
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Lemma A.l [Madras and Zheng (2003)]. 

Gap(P) > - GapfP") Vn e N. 

n 

Note that although [16] state this result for finite state spaces, their proof 
extends easily to general spaces. 

Lemma A. 2 [Madras and Zheng (2003)]. Assume that fip = fiq and that 
P is nonnegative definite. Then 

Gap(QPg)>Gap(P). 

Now consider the context of Theorem 5.2. The lower bound in Theo- 
rem 5.2 follows directly from Theorem A.l and Lemma A.l: 

Gap(P) > iGap(p2) = 1 Gap(pi/2ppi/2) > i Gap(P) minGap(PU,). 
2 2 2 J 
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